High-speed scanning parser for scalable collection of statistics and use in preparing data for machine learning

ABSTRACT

A parser is deployed early in a machine learning pipeline to read raw data and collect useful statistics about the raw data&#39;s content to determine which items of raw data exhibit a proxy for feature importance for the machine learning model. The parser operates at high speeds that approach the disk&#39;s absolute throughput while utilizing a small memory footprint. Utilization of the parser enables the machine learning pipeline to receive a fraction of the total raw data that would otherwise be available. Several scans through the data are performed, by which proxies for feature importance are indicated and irrelevant features may be discarded and thereby not forwarded to the machine learning pipeline. This reduces the amount of memory and other hardware resources used at the server and also expedites the machine learning process.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. Ser. No. 16/408,764, filedMay 10, 2019, entitled, “HIGH-SPEED SCANNING PARSER FOR SCALABLECOLLECTION OF STATISTICS AND USE IN PREPARING DATA FOR MACHINELEARNING”.

BACKGROUND

Machine learning approaches developed across industries (e.g., incommercial and academic organizations) typically utilize an entiredataset loaded in memory to train the machine learning model. Companieswith large datasets quickly hit the limits of the amount of memory thata single server can be equipped with when developing large-scale machinelearning operations. Even with continuous investments in hardware, theproblem remains that volume of data scales disproportionally to theinvestments.

Some solutions utilize distributed computing technologies (e.g., ApacheSpark™) which can be a costly endeavor due to the increased complexityassociated with using multiple machines, including the networking andthe overhead associated with constant monitoring and maintenance. Othersolutions include manually removing parts of data so that the entiredataset can fit in memory on a single server, but this approach can betime-consuming and can lead to a less performant predictive model.

SUMMARY

A parser is deployed early in a machine learning pipeline to read rawdata and collect useful statistics about the raw data's content todetermine which pieces of the raw data to feed into the remainder of thepipeline. The parser operates at high speeds that approach the disk'sabsolute throughput while utilizing a small memory footprint.Utilization of the parser enables the machine learning pipeline toreceive a fraction of the total raw data which exhibits information thatare proxies for feature importance to the machine learning model. Thisreduces the necessary amount of memory and other hardware resources usedsubsequently in the pipeline and also expedites the machine learningprocess.

Multiple stages are utilized by the parser to create a catalogue of datacharacteristics, including proxies for feature importance, for loadinginto the machine learning pipeline. The data characteristics can be, forexample, a summary of the raw data which is placed into the catalogue.The raw data from one or more files in some tabular format withdemarcations between items (e.g., comma-separated values (CSV)) isingested into the computing device. The ingested raw data is scannedseveral times in which each scan results in some analysis or processingof the raw data. The raw data may be scanned in byte format to increasethe processing speed of the parser.

During a type scanning stage, the raw data is scanned to determine atype for each column (e.g., integer, floating-point number, test string,data, etc.) and a total number of rows in the file. Based on that scan,a catalogue is constructed of pre-allocated arrays. The cataloguecollects online statistic objects for each column in the raw data, suchas prevalence, variance, mean, etc. The construction of the cataloguealso includes flags for each column (e.g., the column type and/orsubtype) which identify missing cell values, a placeholder for a countof missing values, and the like. The contents of the catalogue may beaccessed by index to increase processing speed. Constructing thecatalogue with the necessary flags and online statistic objects enablesthe parser to reserve all necessary memory up front for subsequentprocessing, thereby avoiding subsequent memory allocation operations.

During a second scan, the raw data is parsed for delimiters only (e.g.,commas or other separators between items) to identify the largest gapbetween any two delimiters, within the raw data. While other scans mayidentify the delimiters to parse content, the second scan of the rawdata is focused solely on the delimiter locations. The parsing may beperformed on the bytes of raw data to expedite the processing. Apre-allocated buffer is created in memory based on the largestidentified gap (the item with the most characters). That is, a size ofthe pre-allocated buffer corresponds to the size of the largestidentified item so that the buffer can accommodate each item of raw datain the file. In some scenarios, during that second scan the parserperforms a label distribution process in which rows of data are assignedas a testing set or a training set. During subsequent processing, rowsof data labeled as a training set are submitted to the online model forutilization and rows of testing data are not submitted to the onlinemodel.

During a third scan, the raw data is parsed into the constructedcatalogue, one item at a time. Each item in the raw data (e.g., itemsbetween delimiters) is individually placed, one byte at a time, into thepre-allocated buffer. When the item is complete in the buffer an actionis performed on the respective buffer contents. The pre-allocated bufferis reused for each item in the raw data such that additional memorybuffer allocations are unnecessary. In typical scenarios, the contentassociated with the items, when placed into the buffer, is assembledinto a number and that number is pushed into an online statistic objectwithin the catalogue.

After parsing each item of the raw data into the catalogue, thepopulated catalogue with the parsed items may indicate which itemsexceed a threshold of importance over other items within the catalogue,which indicates proxies of feature importance. The raw data may now besubmitted in a tabular data structure for ingestion and processing bythe machine learning pipeline, with the catalogue used as a reference toassess which pieces of raw data to load into memory and provide to themodel. Utilization of the catalogue in one structure enables fastupdating while scanning and can subsequently be changed into a differentstructure for easier manipulation. While the contents of the catalogueidentify proxies for feature importance, the machine learning pipelinemay further refine the feature importance for the final predictive modelgenerated.

Implementing the parser early in the machine learning pipeline allowslarger data sets to be trained on a single server, and in a distributedcontext, leads to lower usage of cluster resources and allows more teamsto perform more projects at once and at a faster rate. The parser,therefore, amplifies the investment in hardware by increasingscalability and reducing the number of hardware components necessary toprocess large sets of raw data in the machine learning model.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter. Furthermore, the claimed subject matter is not limited toimplementations that solve any or all disadvantages noted in any part ofthis disclosure. It will be appreciated that the above-described subjectmatter may be implemented as a computer-controlled apparatus, a computerprocess, a computing system, or as an article of manufacture such as oneor more computer-readable storage media. These and various otherfeatures will be apparent from a reading of the following DetailedDescription and a review of the associated drawings.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows illustrative flowcharts in which the parser is utilizedearly in the machine learning pipeline;

FIG. 2 shows an illustrative diagram in which the raw data is in sometabular format;

FIG. 3 shows an illustrative diagram in which the raw data is separatedby delimiters;

FIG. 4 shows an illustrative diagram in which the raw data is scanned toidentify type and number of columns;

FIG. 5 shows an illustrative diagram in which a catalogue is constructedwith an allocated memory footprint;

FIG. 6 shows an illustrative diagram in which the raw data is parsed fordelimiters;

FIG. 7 shows an illustrative environment in which the raw data is parsedfor label distribution into training or testing sets;

FIG. 8 shows an illustrative diagram in which the raw data is placedinto a pre-allocated buffer and parsed into the catalogue;

FIG. 9 shows an illustrative graph in which a proxy for featureimportance of the raw data is indicated by the parsing process, which ispassed on to the machine learning model for utilization;

FIG. 10 shows an illustrative diagram in which the data in the catalogueis loaded into the machine learning pipeline;

FIG. 11 shows an illustrative table which is used to look up byteswithin the raw data;

FIG. 12 shows a graph in which the parser utilizes a lower memoryfootprint than other methods of processing the data;

FIGS. 13-15 show flowcharts of illustrative methods performed by acomputing device, server, or the like;

FIG. 16 is a simplified block diagram of an architecture of anillustrative computing device that may be used at least in part toimplement the present high-speed scanning parser for scalable collectionof statistics and use in preparing data for machine learning; and

FIG. 17 is a simplified block diagram of an illustrative computingdevice, remote service, or computer system that may be used in part toimplement the present high-speed scanning parser for scalable collectionof statistics and use in preparing data for machine learning.

Like reference numerals indicate like elements in the drawings. Elementsare not drawn to scale unless otherwise indicated.

DETAILED DESCRIPTION

FIG. 1 shows a simplified machine learning pipeline 105 to which data isingested and processed by one or more computing devices 110 (e.g., aserver) to generate a model for predictive analysis. In simplified form,the machine learning pipeline includes the steps of raw data ingestion115, preparation 120, model training 125, and predictions 130. Raw datamay be ingested in step 115, in which the data may be in some tabularformat (e.g., comma-separated values (CSV)). During preparation 120, thedata may be prepared for use in machine learning training. The data maybe randomized, to reduce the possibility of an order affecting themachine learning process, and separated, between a training set fortraining the model and a testing set for testing the trained model.Other forms of data manipulation may be performed as well, such asnormalization, error correction, and the like.

The prepared data may then be used for model training in step 125. Themodel training may be used to incrementally improve the model's abilityto make accurate predictions. The model training may use the featurescontained in the data to form a matrix with weights and biases againstthe data. Random values within the data may be utilized to attemptprediction of the output based on those values. This process may repeatuntil a more accurate model is developed which can predict correctoutputs. The model may subsequently be evaluated to determine if itmeets some accuracy threshold (e.g., 70% or 80% accuracy), and thenpredictions will be performed at step 130. In this step, a question canbe posed to the computing device 110 on which the model operates, andthe computing device can provide an answer using the developed model.

The parser 165 may be utilized early in the machine learning pipeline105 to identify and load data with information proxies for featureimportance to the pipeline to expedite the processing and reduce theamount of memory utilized when processing the data set. The proxy forfeature importance may be data that surpasses some threshold ofimportance relative to other items. As discussed in greater detailbelow, the parser includes several stages which include type scanning135, catalogue construction 140, parsing for delimiters 145, parsing forlabel distribution 150, parsing data into the catalogue 155, and dataloading for model training 160. The parsing for label distribution stage150 is optional and may or may not be implemented depending on theconfiguration.

FIG. 2 shows an illustrative diagram in which the raw data files 205ingested into the machine learning pipeline may be in some tabularformat in which the demarcation between items is documented. The tabularformat may, for example, correlate to a table having rows and columnswith cells which represent the content. An exemplary tabular format caninclude CSV (comma-separated values) as illustratively shown by numeral210.

FIG. 3 shows an example of the raw data 205 for a file 305 in .csvformat 310. The delimiter in a .csv file is a comma which separatesitems 315 of content from each other. The items may include, forexample, the content of the data types under which the content islocated.

FIG. 4 shows an illustrative diagram in which the raw data file 205 isscanned during the type scanning stage 135. Several scans are performedon the raw data file by the parser 165 (FIG. 1 ), in which each scanresults in some processing of the data. The scan 405 performed duringtype scanning may be the first scan on the raw data file in the process,during which the parser identifies (410) a type 415 of each column inthe raw data and a total number 420 of rows within the raw data. Thetype 415 of each column can include whether the content within thecolumns is an integer, a floating-point number, a text string (e.g.,“New York”), data (e.g., a recognizable date format which the parser canconvert into an appropriate numerical value), and the like. Theidentified types and number are stored in an array. The table havingcolumns and rows in FIG. 4 is an example structure of demarcated rawdata.

FIG. 5 shows an illustrative diagram of the operations during thecatalogue construction stage 140. The catalogue is constructed based onthe information that the parser 165 obtains from the scan 405 (FIG. 4 ).The construction of the catalogue creates an allocated memory footprint530 made of pre-allocated arrays 535 so that the amount of memoryutilized and allocated by the parser is set. That is, the memoryfootprint is the maximum amount of memory utilized during parsing of theraw data and therefore additional memory allocation operations areunnecessary. Minimizing memory allocations can lead to exponentialperformance increases when processing the data.

As illustratively represented by numeral 505, the parser assesses a sizeof the allocated memory footprint for catalogue construction. Thisassessment is performed by collecting online statistic objects for eachcolumn (e.g., prevalence, variance, mean, etc.) 510 and identifyingflags for each column 515 (e.g., column type and subtype). The flags mayinclude missing cell values 520 and a placeholder for a count of missingvalues 525, among other flags. For example, negative signs for negativenumbers and decimal points for floating point numbers can be flagged aswell, in which the flag can be used to modify the processed cell.

Performing this pre-processing of the data enables the parser to reservea stable amount of memory in which all subsequent operations can takeplace. The online statistic objects and flags may be stored inrespective pre-allocated arrays. Online statistic objects for eachcolumn can accept data from a respective column's population in anyorder and at different times, and then produce statistics that reflectthe whole population. The catalogue contents may be accessed by index toincrease processing speed.

The online statistic objects can be calculated in “streaming” fashion,that is, without necessarily requiring all the data in the same place atthe same time and without storing the raw data. Online statistics can becollected on pieces of the raw data at different times or on differentcomputers and can be merged together to determine the whole population.Because online statistics do not store the raw data points,implementations in computer programs have consistent, constant, andpredictable resource usage and processing times.

Multiple catalogues can be merged together as long as the raw data isstructured in the same way (i.e., has the same columns). Statistics canbe collected from different portions of the data set hosted on differentmachines, or on data that arrives subsequently in time. This facilityenables the parser 165 to be used in parallel contexts, on distributedsystems, or for federated analytics.

FIG. 6 shows an illustrative diagram in which a second scan 605 isperformed on the raw data 205 to parse for delimiters 145. The scan isperformed on the data in byte format to expedite processing of the data,as illustratively shown by numeral 625. During this stage, the parser165 identifies delimiters in step 610. Delimiters can include somedemarcation (e.g., commas) between items of data. In step 615, theparser identifies a largest cell (i.e., item of data between delimiters)within the raw data. In step 620, the parser pre-allocates a bufferbased on the size of the largest identified cell of data.

The pre-allocated buffer 630 is utilized during subsequent processing tohold, process, and assemble the raw data contents (e.g., parts of theitem) into the catalogue. No additional memory allocation operations arenecessary beyond this parsing stage since a size of the buffer ispre-allocated because the largest possible cell has been identified. Thepre-allocated array is repeatedly overwritten with data duringsubsequent processing, rather than being de-allocated and re-allocated.This configuration facilitates the high-speed processing of a small andconstant memory footprint of the parser and minimizes calls to thegarbage collector if such is used by the implementation language. Anexample of the data in byte format (ASCII decimal) is shown, in whichthe commas (delimiters) are represented by the number 44 and the largestcell of data is emphasized in bold and underline.

FIG. 7 shows an illustrative diagram of an optional step in which theraw data is parsed for label distribution, as illustratively shown bynumeral 150. During the second scan of the raw data, the parser performsonline modeling 705 in which the parser adds a label column 715 to thecatalogue and uses a user-specified label for the rows of data 710. Thatis, the user may specify the labels and then the parser acts on thisspecification. A zeroed array is pre-allocated to hold the membershiplabels and may be the same size as the data set. This increases the sizeof the memory allocated for the overall operation (i.e., the memoryfootprint for the catalogue) but only by a single column of raw data.

Each row may be identified by membership 720 as a training set 725 or atesting set 730. Training set data is used to train the model andtesting set data is used to measure how well the created model performsat making predictions. Upon identifying whether the column is a trainingor testing data set, the memberships for each row are added to thecatalogue 735. With this optional step, the previously allocated memoryfootprint 530 (FIG. 5 ) and pre-allocated buffer 630 (FIG. 6 ) mayincrease, but the size may still not subsequently change during scanningand would thereby still be stable.

There is a class of algorithms in machine learning, called “onlinemodels”, that may not require all of the training set data to be inmemory at the same time, with the caveat that the generated model maysuffer from lesser accuracy than standard predictive models. There maybe one of two optional steps that are performed. In one optional step,the parser extracts proxies for feature importance from the ingested rawdata. In the other optional step, an online model is used for actualfeature importance. For either option the proxies or feature importanceare still different from (though closely related to) the final featureimportance from the final model which is created during the later stagesof the machine learning pipeline (i.e., the machine learning pipeline105 from FIG. 1 ).

The machine learning pipeline 105 typically expects the raw data to besplit into training and testing sets. To prevent risk of polluting thetraining set and thereby the quality of the final predictive model, rowsin the testing set cannot be utilized in the online model. Therefore, ifan online model is to be used, rows of data are assigned training ortesting labels during scanning to ensure equal sizes of membership amongthe two.

To achieve this end, the parsing for label distribution step 150 isoptionally utilized when online modeling is desired. This stepdetermines the distribution of membership of the raw data. In typicalimplementations, a config. file is written in a standard text editor,but in other embodiments a user interface could be configured to enablea user to specify the column location of the membership labels and thecontents of the label column read into a label array instead of thecatalogue. The online model object manages its own memory allocation bypre-allocating the model beforehand, so the memory footprint of theonline model object is stable with a relatively minor increase in thecatalogue's size.

FIG. 8 shows an illustrative diagram in which the data is parsed intothe catalogue 155. The parser scans 805 over the file byte-by-byte andeach item of content within the columns is individually transferred tothe pre-allocated buffer 630 for processing. The parser performs anaction based on the type of column under which a respective cell lies(e.g., integer, floating-point, text string, data) and the content ofthe cell 810. In typical scenarios, the parser assembles the content(e.g., an item or parts of an item) into a number and pushes that numberinto an online statistic object within the catalogue 815, but otheractions are also possible.

In scenarios in which parsing for label distribution is performed, therow contents may be submitted to an online model depending on the row'smembership (i.e., training or testing set). This enables consistentusage of the rows of data when the catalogue is fed into the machinelearning pipeline. The parser for this optional step operates similarlyto the parsing of the content shown in FIG. 8 but skips all of thecolumns apart from the label column and copies the content in thepre-allocated buffer into the label array instead of the catalogue.

A proxy for the feature importance of the raw data within the cataloguecan be assessed after parsing through each item of data. For example,columns having an online statistic result with a relatively high numbercan be more relevant than columns having an online statistic result witha relatively lower number. FIG. 9 shows an illustrative graph in whichportions of data exhibit greater feature importance than other portionsof data. There is a sharp increase in importance at the top of the list,but most features may not be important to the machine learning modeling.Typically, the feature importance reveals how important each column(e.g., a predictor) is in influencing the outcome of a question. Theparser enables the identification of proxies for feature importance tofeed into the machine learning pipeline to discard unimportant features.This can increase the overall processing speed of the machine learningpipeline to generate a model of equal quality or, in some scenarios,increased quality by eradicating noisy or irrelevant features which cannegatively affect the learning process. While the parser identifies aproxy for feature importance, the machine learning pipeline may furtherrefine the feature importance for the final predictive model generated,so the final feature importance may not be the same as the data shown inFIG. 9 .

FIG. 10 shows an illustrative diagram in which the parsed information isused as a reference to load the raw data into the model for training(e.g., into the machine learning pipeline 105) in step 160. The raw data205 may be loaded as a tabular data structure 1010 into the machinelearning pipeline 105, in which the data loaded may use the catalogueinformation 1005 as a reference to determine which pieces of the rawdata to load into the pipeline or disregard. The information from thecatalogue can be used to pre-allocate a zeroed optimal tabular datastructure in memory, for speed and to ensure there is sufficient freememory to fit the required raw data. Then the raw data from the disk isscanned over and, this time, loaded into the zeroed optimal tabular datastructure instead of the catalogue. The machine learning pipeline 105then proceeds as normal.

In contrast to a standard approach in which all of the data may beloaded, utilization of the catalogue enables the parser to skip overunnecessary data, thus reducing the needed size of the tabular datastructure, insert numerical data directly into the data structurewithout employing a conversion step, and utilize a table to look upstring values to reference the mapping from the catalogue, as shown inFIG. 11 , so that only codes for string values are loaded directly intothe tabular data structure. For example, the meaning of a byte in theraw data can be determined using a table to look up a corresponding codefor the byte. Columns with bytes representing categories with stringscan be translated into a number to identify that string, such as “1” forNY, “2” for NJ, “3” for CT, and so on. The number is inserted directlyinto the tabular data structure without needing to load the string.

While numerical data can be processed directly into the data structurewithout conversion, negative data values are processed using associatedflags. For example, a flag is set within the catalogue if a byte thatcorresponds to a negative sign is identified. In some scenarios, thenegative sign is not loaded into the buffer to save processing time, butrather an additional operation on the cell's value may be performed,such as subtracting the cell's value from zero to effectively make thecell a negative number.

An additional variable is stored to identify a location of a decimalpoint inside a cell when the decimal point is detected in byte form.Similarly to the handling of the negative sign for numbers, in order tosave time, the decimal point byte may not be loaded into memory, butrather the presence of the decimal point switches a flag which modifiesthe place value operation of the cell after the processing of thenumber.

FIG. 12 shows a graph which illustrates the benefits provided by theparser 165 disclosed herein. Two experiments were performed in whichdata were loaded into memory from CSV files using various methods (aproxy for feature importance was not assessed), in which one experimentcomprised 1.03 gigabytes (GB) of raw data and the other experimentcomprised 266 megabytes (MB) of raw data. Several of the methodsallocated relatively large amounts of data as shown in the graph,whereas the present parser allocated approximately 360 kilobytes (kB) ofdata for both experimental files. As shown, the magnitude of the benefitrealized by the disclosed parser is more than ten times more efficientin terms of memory allocation relative to the most efficient open sourcemethod used in this experiment (Method 7). The methods used forcomparison listed in FIG. 12 are:

-   -   Method 1: Julia standard library eachline function, iterating by        row with a JSON parser    -   Method 2: Julia standard library eachline function, iterating by        row, with a standard library parser    -   Method 3: TextParse.jl, iterating by row    -   Method 4: CSV.jl File handler, iterating by whole file    -   Method 5: TextParse.jl, iterating by whole file    -   Method 6: Julia standard library eachline function, iterating by        characters within row with a JSON parser    -   Method 7: CSV.jl File handler, iterating by row

In some scenarios, the parser operates in parallel, but it can also runin a single-threaded fashion depending on the capabilities of thecomputing device. The parser can also work in a parallel fashion,expanding to the number of cores on the machine.

Parallelism is made possible, in part, due to the small memory footprintutilized. A parallel operation usually involves reserving areas ofmemory for each thread. The parser 165 scales well because the size ofthe footprint multiplied by the number of threads fall below the maximumsystem memory available. Various parallelism options are available withthe parser 165.

A simplified parallelism process uses parallelism across files. Wherethe raw data is already split into multiple files, one parser can be runper file using its own catalogue, and the catalogues can then be mergedwhen the parsing is complete. This allows, for example, in 8-threadoperation, for eight files to be scanned in the same length of time asone file (depending on how the disk is accessed).

Parallelizing within a file is a method in which data parallelism isperformed by skipping rows. Each thread is given its own parser andcatalogue, a starting row number and an ending row number. The parserreads through the byte stream, skipping all content until the ending rownumber indicating the required number of rows is passed, then startsreading the data into the buffer as normal. When the file has beenscanned, the catalogues from each thread are merged into one.

Parallelism can occur across machines. In one example, to parallelizeusing multiple machines is to have the data split across the machines,prepare catalogues for each, and, once parsed, bring the catalogues to asingle machine and merge them. The size of each catalogue may be trivialand quick to transfer across a network, since the catalogue onlyincludes summary information and does not contain raw data.

FIG. 13 shows a flowchart of an illustrative method 1300 which may beimplemented on a server or other computing device. Unless specificallystated, methods or steps shown in the flowcharts and described in theaccompanying text are not constrained to a particular order or sequence.In addition, some of the methods or steps thereof can occur or beperformed concurrently and not all the methods or steps have to beperformed in a given implementation depending on the requirements ofsuch implementation and some methods or steps may be optionallyutilized.

In step 1305, the computing device ingests raw data from a file, inwhich the raw data has demarcations between items of data. In step 1310,the computing device performs one or more scans on the file, in whichone or more processes on the data is performed during the respectivescans. In step 1315, the computing device allocates a memory footprintfor a catalogue, wherein the memory footprint indicates a maximum amountof memory utilized during a machine learning optimization process. Instep 1320, the computing device creates a pre-allocated buffer having asize which corresponds to a largest sized item within the raw data. Instep 1325, the computing device individually transfers an item of theraw data into the created pre-allocated buffer. In step 1330, thecomputing device parses the item of the raw data into the catalogueresponsive to the item of the raw data being transferred into thepre-allocated buffer. Each item is individually and sequentiallytransferred into the pre-allocated buffer and then parsed into thecatalogue. In step 1335, the computing device loads the raw data intothe machine learning pipeline for utilization by using the parsedcatalogue as a reference.

FIG. 14 shows a flowchart of an illustrative method 1400 which may beperformed by a computing device or server. In step 1405, the computingdevice determines a size of a memory footprint for a catalogue. In step1410, the computing device allocates the determined size of the memoryfootprint for the catalogue, in which the allocated memory footprint isutilized during processing of raw data. In step 1415, the computingdevice individually transfers a demarcated item from the raw data into abuffer. In step 1420, the computing device parses the item of the rawdata that is transferred into the buffer into the catalogue. Eachdemarcated item of the raw data is individually and sequentiallytransferred into the buffer and parsed into the catalogue. In step 1425,the computing device loads the raw data into a machine learning pipelineusing the catalogue with the parsed items as a reference.

FIG. 15 shows a flowchart of an illustrative method 1500 which may beperformed by a computing device or some remote service. In step 1505,the computing device ingests raw data from a file. In step 1510, thecomputing device performs a first scan on the raw data in the file, bywhich a memory footprint is allocated for a catalogue. In step 1515,responsive to completion of the first scan, the computing deviceperforms a second scan on the raw data in the file, by which apre-allocated buffer is created. In step 1520, responsive to completionof the second scan, the computing device performs a third scan on theraw data in the file, by which each item of the raw data is individuallytransferred into the pre-allocated buffer. In step 1525, responsive tocompletion of the third scan, the computing device loads the raw datausing summary information about the parsed items into a machine learningpipeline for utilization, in which the catalogue is used as a referencefor which pieces of raw data to load.

FIG. 16 shows an illustrative architecture 1600 for a computing devicesuch as a laptop computer or personal computer for the presenthigh-speed scanning parser for scalable collection of statistics and usein preparing data for machine learning. The architecture 1600illustrated in FIG. 16 includes one or more processors 1602 (e.g.,central processing unit, dedicated Artificial Intelligence chip,graphics processing unit, etc.), a system memory 1604, including RAM(random access memory) 1606 and ROM (read only memory) 1608, and asystem bus 1610 that operatively and functionally couples the componentsin the architecture 1600. A basic input/output system containing thebasic routines that help to transfer information between elements withinthe architecture 1600, such as during startup, is typically stored inthe ROM 1608. The architecture 1600 further includes a mass storagedevice 1612 for storing software code or other computer-executed codethat is utilized to implement applications, the file system, and theoperating system. The mass storage device 1612 is connected to theprocessor 1602 through a mass storage controller (not shown) connectedto the bus 1610. The mass storage device 1612 and its associatedcomputer-readable storage media provide non-volatile storage for thearchitecture 1600. Although the description of computer-readable storagemedia contained herein refers to a mass storage device, such as a harddisk or CD-ROM drive, it may be appreciated by those skilled in the artthat computer-readable storage media can be any available storage mediathat can be accessed by the architecture 1600.

By way of example, and not limitation, computer-readable storage mediamay include volatile and non-volatile, removable and non-removable mediaimplemented in any method or technology for storage of information suchas computer-readable instructions, data structures, program modules, orother data. For example, computer-readable media includes, but is notlimited to, RAM, ROM, EPROM (erasable programmable read only memory),EEPROM (electrically erasable programmable read only memory), Flashmemory or other solid state memory technology, CD-ROM, DVD, HD-DVD (HighDefinition DVD), Blu-ray, or other optical storage, magnetic cassette,magnetic tape, magnetic disk storage or other magnetic storage device,or any other medium which can be used to store the desired informationand which can be accessed by the architecture 1600.

According to various embodiments, the architecture 1600 may operate in anetworked environment using logical connections to remote computersthrough a network. The architecture 1600 may connect to the networkthrough a network interface unit 1616 connected to the bus 1610. It maybe appreciated that the network interface unit 1616 also may be utilizedto connect to other types of networks and remote computer systems. Thearchitecture 1600 also may include an input/output controller 1618 forreceiving and processing input from a number of other devices, includinga keyboard, mouse, touchpad, touchscreen, control devices such asbuttons and switches or electronic stylus (not shown in FIG. 16 ).Similarly, the input/output controller 1618 may provide output to adisplay screen, user interface, a printer, or other type of outputdevice (also not shown in FIG. 16 ).

It may be appreciated that the software components described herein may,when loaded into the processor 1602 and executed, transform theprocessor 1602 and the overall architecture 1600 from a general-purposecomputing system into a special-purpose computing system customized tofacilitate the functionality presented herein. The processor 1602 may beconstructed from any number of transistors or other discrete circuitelements, which may individually or collectively assume any number ofstates. More specifically, the processor 1602 may operate as afinite-state machine, in response to executable instructions containedwithin the software modules disclosed herein. These computer-executableinstructions may transform the processor 1602 by specifying how theprocessor 1602 transitions between states, thereby transforming thetransistors or other discrete hardware elements constituting theprocessor 1602.

Encoding the software modules presented herein also may transform thephysical structure of the computer-readable storage media presentedherein. The specific transformation of physical structure may depend onvarious factors in different implementations of this description.Examples of such factors may include, but are not limited to, thetechnology used to implement the computer-readable storage media,whether the computer-readable storage media is characterized as primaryor secondary storage, and the like. For example, if thecomputer-readable storage media is implemented as semiconductor-basedmemory, the software disclosed herein may be encoded on thecomputer-readable storage media by transforming the physical state ofthe semiconductor memory. For example, the software may transform thestate of transistors, capacitors, or other discrete circuit elementsconstituting the semiconductor memory. The software also may transformthe physical state of such components in order to store data thereupon.

As another example, the computer-readable storage media disclosed hereinmay be implemented using magnetic or optical technology. In suchimplementations, the software presented herein may transform thephysical state of magnetic or optical media, when the software isencoded therein. These transformations may include altering the magneticcharacteristics of particular locations within given magnetic media.These transformations also may include altering the physical features orcharacteristics of particular locations within given optical media tochange the optical characteristics of those locations. Othertransformations of physical media are possible without departing fromthe scope and spirit of the present description, with the foregoingexamples provided only to facilitate this discussion.

The architecture 1600 may further include one or more sensors 1614 or abattery or power supply 1620. The sensors may be coupled to thearchitecture to pick up data about an environment or a component,including temperature, pressure, etc. Exemplary sensors can include athermometer, accelerometer, smoke or gas sensor, pressure sensor(barometric or physical), light sensor, ultrasonic sensor, gyroscope,among others. The power supply may be adapted with an AC power cord or abattery, such as a rechargeable battery for portability.

In light of the above, it may be appreciated that many types of physicaltransformations take place in the architecture 1600 in order to storeand execute the software components presented herein. It also may beappreciated that the architecture 1600 may include other types ofcomputing devices, including wearable devices, handheld computers,embedded computer systems, smartphones, PDAs, and other types ofcomputing devices known to those skilled in the art. It is alsocontemplated that the architecture 1600 may not include all of thecomponents shown in FIG. 16 , may include other components that are notexplicitly shown in FIG. 16 , or may utilize an architecture completelydifferent from that shown in FIG. 16 .

FIG. 17 is a simplified block diagram of an illustrative computer system1700 such as a PC or server with which the present high-speed scanningparser for scalable collection of statistics and use in preparing datafor machine learning may be implemented. Computer system 1700 includes aprocessor 1705, a system memory 1711, and a system bus 1714 that couplesvarious system components including the system memory 1711 to theprocessor 1705. The system bus 1714 may be any of several types of busstructures including a memory bus or memory controller, a peripheralbus, or a local bus using any of a variety of bus architectures. Thesystem memory 1711 includes read only memory (ROM) 1717 and randomaccess memory (RAM) 1721. A basic input/output system (BIOS) 1725,containing the basic routines that help to transfer information betweenelements within the computer system 1700, such as during startup, isstored in ROM 1717. The computer system 1700 may further include a harddisk drive 1728 for reading from and writing to an internally disposedhard disk (not shown), a magnetic disk drive 1730 for reading from orwriting to a removable magnetic disk 1733 (e.g., a floppy disk), and anoptical disk drive 1738 for reading from or writing to a removableoptical disk 1743 such as a CD (compact disc), DVD (digital versatiledisc), or other optical media. The hard disk drive 1728, magnetic diskdrive 1730, and optical disk drive 1738 are connected to the system bus1714 by a hard disk drive interface 1746, a magnetic disk driveinterface 1749, and an optical drive interface 1752, respectively. Thedrives and their associated computer-readable storage media providenon-volatile storage of computer-readable instructions, data structures,program modules, and other data for the computer system 1700. Althoughthis illustrative example includes a hard disk, a removable magneticdisk 1733, and a removable optical disk 1743, other types ofcomputer-readable storage media which can store data that is accessibleby a computer such as magnetic cassettes, Flash memory cards, digitalvideo disks, data cartridges, random access memories (RAMs), read onlymemories (ROMs), and the like may also be used in some applications ofthe present high-speed scanning parser for scalable collection ofstatistics and use in preparing data for machine learning. In addition,as used herein, the term computer-readable storage media includes one ormore instances of a media type (e.g., one or more magnetic disks, one ormore CDs, etc.). For purposes of this specification and the claims, thephrase “computer-readable storage media” and variations thereof, areintended to cover non-transitory embodiments, and do not include waves,signals, and/or other transitory and/or intangible communication media.

A number of program modules may be stored on the hard disk, magneticdisk 1733, optical disk 1743, ROM 1717, or RAM 1721, including anoperating system 1755, one or more application programs 1757, otherprogram modules 1760, and program data 1763. A user may enter commandsand information into the computer system 1700 through input devices suchas a keyboard 1766 and pointing device 1768 such as a mouse. Other inputdevices (not shown) may include a microphone, joystick, game pad,satellite dish, scanner, trackball, touchpad, touchscreen,touch-sensitive device, voice-command module or device, user motion oruser gesture capture device, or the like. These and other input devicesare often connected to the processor 1705 through a serial portinterface 1771 that is coupled to the system bus 1714, but may beconnected by other interfaces, such as a parallel port, game port, oruniversal serial bus (USB). A monitor 1773 or other type of displaydevice is also connected to the system bus 1714 via an interface, suchas a video adapter 1775. In addition to the monitor 1773, personalcomputers typically include other peripheral output devices (not shown),such as speakers and printers. The illustrative example shown in FIG. 17also includes a host adapter 1778, a Small Computer System Interface(SCSI) bus 1783, and an external storage device 1776 connected to theSCSI bus 1783.

The computer system 1700 is operable in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 1788. The remote computer 1788 may be selected as anotherpersonal computer, a server, a router, a network PC, a peer device, orother common network node, and typically includes many or all of theelements described above relative to the computer system 1700, althoughonly a single representative remote memory/storage device 1790 is shownin FIG. 17 . The logical connections depicted in FIG. 17 include a localarea network (LAN) 1793 and a wide area network (WAN) 1795. Suchnetworking environments are often deployed, for example, in offices,enterprise-wide computer networks, intranets, and the Internet.

When used in a LAN networking environment, the computer system 1700 isconnected to the local area network 1793 through a network interface oradapter 1796. When used in a WAN networking environment, the computersystem 1700 typically includes a broadband modem 1798, network gateway,or other means for establishing communications over the wide areanetwork 1795, such as the Internet. The broadband modem 1798, which maybe internal or external, is connected to the system bus 1714 via aserial port interface 1771. In a networked environment, program modulesrelated to the computer system 1700, or portions thereof, may be storedin the remote memory storage device 1790. It is noted that the networkconnections shown in FIG. 17 are illustrative and other means ofestablishing a communications link between the computers may be useddepending on the specific requirements of an application of the presenthigh-speed scanning parser for scalable collection of statistics and usein preparing data for machine learning.

Various exemplary embodiments of the present high-speed scanning parserfor scalable collection of statistics and use in preparing data formachine learning are now presented by way of illustration and not as anexhaustive list of all embodiments. An example includes a methodperformed by a computing device for optimization of a machine learningpipeline, comprising: ingesting raw data from a file, in which the rawdata has demarcations between items of data; performing one or morescans on the file, in which the one or more scans are used to: allocatea memory footprint for a catalogue, wherein the memory footprintindicates a maximum amount of memory utilized during the machinelearning pipeline optimization process to avoid performance ofadditional memory allocation operations during subsequent processing,create a pre-allocated buffer having a size which corresponds to alargest sized item within the raw data, individually transfer an item ofthe raw data into the created pre-allocated buffer, and parse thetransferred item of the raw data into the catalogue responsive to theitem of the raw data being transferred into the pre-allocated buffer,wherein each item of the raw data is individually and sequentiallytransferred into the pre-allocated buffer and parsed into the catalogue;and load the raw data into the machine learning pipeline for utilizationby using the parsed catalogue as a reference.

In another example, the raw data in the file is in CSV (comma-separatedvalues) format. In another example, the one or more scans on the fileare performed on bytes of the raw data. In another example, the items ofdata are demarcated by a delimiter. In another example, the delimiter isa comma, tab, or pipe. In another example, the items of the raw data aretransferred in byte form into the pre-allocated buffer. In anotherexample, the memory footprint is allocated based on a size of thecatalogue, wherein a construction of the catalogue includes holdingonline statistic objects for each column of the raw data, in which theonline statistic objects accept data from a respective column'spopulation. In another example, parsing the items of the raw dataincludes assembling the items within the pre-allocated buffer into anumber and pushing the number to a corresponding online statistic objectinside the catalogue. In another example, the allocated memory footprintis comprised of pre-allocated arrays for individual objects, includingthe online statistic objects and flags identified for each column.

A further example includes a computing device configured to parse rawdata for use by a machine learning pipeline, comprising: one or moreprocessors; and one or more hardware-based memory devices havinginstructions which, when executed by the one or more processors, causethe computing device to: determine a size of a memory footprint for acatalogue; allocate the determined size of the memory footprint for thecatalogue, in which the allocated memory footprint is utilized duringprocessing of raw data to reduce a number of times the memory footprintis re-allocated in memory; individually transfer a demarcated item fromthe raw data into a buffer; parse the item of the raw data that istransferred into the buffer into the catalogue, wherein each demarcateditem of the raw data is individually and sequentially transferred intothe buffer and parsed into the catalogue; and load the raw data into amachine learning pipeline using the catalogue with the parsed items as areference.

In another example, the buffer is a pre-allocated buffer. In anotherexample, the executed instructions further cause the computing device toparse for delimiters in the raw data that demarcate the items, identifya largest item between delimiters, and create the pre-allocated bufferusing a size based on the identified largest item. In another example,the parsing for delimiters and the identification of the largest itemsare performed on the raw data in byte format. In another example, theexecuted instructions further cause the computing device to parse theraw data for label distribution, in which the label distributionincludes designating user-specified rows of raw data as a training setor a testing set for utilization by the machine learning pipeline,wherein the parsing is performed during a scan in which the computingdevice parses the raw data for the delimiters and identifies the largestitem in the raw data. In another example, the parsed individual itemsare assembled into a number and the items or parts of the items arepushed to a relevant online statistic object in the catalogue, andwherein the online statistic object is utilized to determine which rawdata from the catalogue to load into the machine learning pipeline. Inanother example, the catalogue is loaded into the machine learningpipeline in a tabular data structure.

A further example includes one or more hardware-based non-transitorycomputer-readable memory devices storing instructions which, whenexecuted by one or more processors disposed in computing device, causethe computing device to: ingest raw data from a file; perform a firstscan on the raw data in the file, by which a memory footprint isallocated for a catalogue; responsive to completion of the first scan,perform a second scan on the raw data in the file by which apre-allocated buffer is created, wherein a size of the pre-allocatedbuffer is based on a largest item of raw data in the file, in whichitems within the raw data are demarcated by delimiters; responsive tocompletion of the second scan, perform a third scan on the raw data inthe file, by which each item of the raw data is individually transferredinto the created pre-allocated buffer and parsed into the catalogue; andresponsive to completion of the third scan, load the raw data into amachine learning pipeline for utilization, in which the catalogue isused as a reference for which pieces of raw data to load.

In another example, each scan is performed on the raw data as bytes. Inanother example, items are assembled and pushed into an onlinestatistics object within the catalogue which informs which data withinthe catalogue is to be loaded into the machine learning pipeline. Inanother example, the online statistics object includes one or more ofvariance or prevalence.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing the claims.

What is claimed:
 1. A computer-implemented method comprising: allocatinga buffer corresponding to one or more sizes of data in a set of data;transferring the set of data into the allocated buffer; parsing the setof data in the allocated buffer to identify a first set of features anda second set of features within the set of data, wherein the parsing ofthe set of data produces a catalogue, and wherein a subset of the set ofdata is identified based on the first set features; and training amachine-learning system using the identified subset of data.
 2. Themethod of claim 1, wherein the catalogue comprises data characteristicsto identify the first set of features.
 3. The method of claim 1, furthercomprising: determining whether the first set of features exceed athreshold level.
 4. The method of claim 1, further comprising: utilizingthe catalogue to reserve memory for processing of the set of data. 5.The method of claim 1, wherein a utilization of online statistics of thefirst set of features reduce memory allocations.
 6. The method of claim1, further comprising: utilizing the catalogue to identify whether thefirst set of data features or second set of data features exceed athreshold level.
 7. The method of claim 1, wherein the catalogue isproduced based on one or more scans of the set of data.
 8. A computerprogram product comprising a tangible storage medium encoded withprocessor-readable instructions that, when executed by one or moreprocessors, enable the computer program product to: allocate a buffercorresponding to one or more sizes of data in a set of data; transferthe set of data into the allocated buffer; parse the set of data in theallocated buffer to identify a first set of features and a second set offeatures within the set of data, wherein the parsing of the set of dataproduces a catalogue, and wherein a subset of the set of data isidentified based on the first set features; and train a machine-learningsystem using the identified subset of data.
 9. The computer programproduct of claim 8, wherein the catalogue is used to identify the firstset of features to be placed into the machine-learning system.
 10. Thecomputer program product of claim 8, wherein the catalogue enables aparser to determine if the first set of features exceed a thresholdlevel.
 11. The computer program product of claim 8, wherein thecatalogue production includes collecting statistics within the set ofdata.
 12. The computer program product of claim 8, wherein contents ofthe catalogue are accessed in relation to processing capability.
 13. Thecomputer program product of claim 8, wherein the catalogue productioncreates an allocated memory footprint.
 14. The computer program productof claim 8, wherein the set of data is labeled within the catalogue. 15.A computer system connected to a network, the system comprising: one ormore processors configured to: allocate a buffer corresponding to one ormore sizes of data in a set of data; transfer the set of data into theallocated buffer; parse the set of data in the allocated buffer toidentify a first set of features and a second set of features within theset of data, wherein the parsing of the set of data produces acatalogue, and wherein a subset of the set of data is identified basedon the first set features; and train a machine-learning system using theidentified subset of data.
 16. The computer system of claim 15, whereina determination is made as to whether the first set of features exceed athreshold level.
 17. The computer system of claim 15, wherein thecatalogue is used to identify differences between the first set offeatures and the second set of features.
 18. The computer system ofclaim 15, wherein the catalogue production reduces one or more memoryallocations.
 19. The computer system of claim 15, wherein the catalogueproduction enables statistics and/or one or more flags within the set ofdata to be collected.
 20. The computer system of claim 15, wherein theparser identifies when the first or second set of features pass athreshold level.